I can use PyMC3 to estimate the posterior PDF for the true skill level of every BUDA team using its W-L record during the season. This could be superior to my initial approach of assuming a discrete skill level based on Division and then modifying that skill level based on individual game outcomes. It will also serve as a great playground in which to explore PyMC3.
Now that I've tested that things out with a small sample of artificial game data, it's time to run this on BUDA summer club league data.
In [2]:
import pandas as pd
import os
import numpy as np
import pymc3 as pm
from pymc3.math import invlogit
from tqdm import tqdm
from theano import tensor as tt
%matplotlib inline
In [2]:
project_dir = '/Users/rbussman/Projects/BUDA/buda-ratings'
scores_dir = os.path.join(project_dir, 'data', 'raw', 'game_scores')
In [3]:
# summer club league 2016
league_id = 40264
game_scores = pd.read_csv(os.path.join(scores_dir, 'scores_{}.csv'.format(league_id)))
In [4]:
game_scores.head()
Out[4]:
In [6]:
game_scores['Team A Wins'] = game_scores['Score A'] > game_scores['Score B']
In [7]:
game_scores.tail()
Out[7]:
In [8]:
teams = game_scores['Team A'].unique()
In [9]:
n_teams = len(teams)
In [10]:
game_scores.head()
Out[10]:
In [11]:
team2index = {}
team2div = {}
for i, team in enumerate(teams):
row = game_scores['Team A'] == team
div_team = game_scores.loc[row, 'divname'].unique()[0]
team2div[team] = div_team
team2index[team] = i
In [12]:
game_scores['Index A'] = game_scores['Team A'].apply(lambda x: team2index[x])
game_scores['Index B'] = game_scores['Team B'].apply(lambda x: team2index[x])
game_scores['Div A'] = game_scores['Team A'].apply(lambda x: team2div[x])
game_scores['Div B'] = game_scores['Team B'].apply(lambda x: team2div[x])
In [13]:
sub_scores = game_scores.copy()
for div in ['Open Div 1', 'Open Div 2']:
sub1 = (sub_scores['Div A'] == div) | (sub_scores['Div B'] == div)
sub_scores = sub_scores[~sub1]
In [14]:
sub_scores.tail()
Out[14]:
In [15]:
n_teams = len(sub_scores['Team A'].unique())
In [16]:
n_teams
Out[16]:
In [17]:
len(sub_scores)
Out[17]:
Prior on each team is a normal distribution with mean of 1600 and standard deviation of 200.
First attempt with 78 teams and 1166 game scores leads to a compilation error: "fatal error: bracket nesting level exceeded maximum of 256". I will try to get around this by filtering out the Open teams and de-duplicating the game scores database. Currently every game score is present twice.
Removing Open teams and de-duplicating gets me down to 67 teams and 527 scores. With this many teams and scores, I still have compilation errors or computer restart issues. Trimming down to top 30 teams is manageable, however. For now, stick with top 30 teams. Will plan to come back later on to explore optimization options.
The most pressing issue now is that Div2 teams get similar scores to Div1 teams. I need to use a different prior on skill for each division. Roughly speaking, I expect that:
Posted on pymc3 discourse page and got a helpful response from @aseyboldt. Turns out I can just index skill
with an array of integers and get the speed boost that I need.
In [18]:
pair_list = []
for row in tqdm(sub_scores.index):
team_A = sub_scores.loc[row, 'Index A']
team_B = sub_scores.loc[row, 'Index B']
new_pair = (team_A, team_B)
pair_list.append(new_pair)
reverse_pair = (team_B, team_A)
if reverse_pair in pair_list:
sub_scores = sub_scores.drop(row)
In [19]:
len(sub_scores)
Out[19]:
In [20]:
sub_scores = sub_scores.reset_index().drop('index', axis=1)
In [21]:
sub_scores.tail()
Out[21]:
In [22]:
n_teams
Out[22]:
In [23]:
sub_scores
Out[23]:
In [24]:
teams = set(np.append(sub_scores['Team A'].unique(), sub_scores['Team B'].unique()))
In [25]:
len(teams)
Out[25]:
In [80]:
skill_prior_div = {
'4/3 Div 1': 3.0,
'4/3 Div 2': 0.0,
'4/3 Div 3': -1.0,
'5/2 Div 1': 3.0,
'5/2 Div 2': 0.0,
'5/2 Div 3': -1.0
}
alphas = []
for i in range(len(teams)):
if i in sub_scores['Index A'].values:
index = sub_scores['Index A'] == i
div = sub_scores.loc[index, 'Div A'].unique()[0]
alpha = skill_prior_div[div]
alphas.append(alpha)
else:
index = sub_scores['Index B'] == i
div = sub_scores.loc[index, 'Div B'].unique()[0]
alpha = skill_prior_div[div]
alphas.append(alpha)
# div = sub_scores[team]
# alpha = skill_prior_div[div]
# print(team, alpha)
# alphas.append(alpha)
In [82]:
team1 = sub_scores['Index A'].values
team2 = sub_scores['Index B'].values
with pm.Model() as model:
skill_sd = pm.HalfStudentT('skill_sd', sd=2.5, nu=3)
skill = pm.Normal('skill', mu=alphas, shape=len(teams))
logit_p = skill_sd * (skill[team1] - skill[team2])
# logit_p = skill[team1] - skill[team2]
# lower = 1e-6
# upper = 1 - 1e-6
# B_minus_A = skill[team2] - skill[team1]
# probability_A_beats_B = lower + (upper - lower) * 1 / (1 + tt.exp(B_minus_A))
p = tt.nnet.sigmoid(logit_p)
win = pm.Bernoulli('win', p, observed=sub_scores['Team A Wins'].values)
In [83]:
with model:
trace = pm.sample(1000)
In [84]:
pm.traceplot(trace)
Out[84]:
In [85]:
trace.varnames
Out[85]:
In [86]:
meanskills = trace.get_values('skill').mean(axis=0)
In [87]:
for i, meanskill in enumerate(meanskills):
if i in sub_scores['Index A'].values:
index = sub_scores['Index A'] == i
name = sub_scores.loc[index, 'Team A'].unique()[0]
else:
index = sub_scores['Index B'] == i
name = sub_scores.loc[index, 'Team B'].unique()[0]
print("{}: {:.3f}".format(name, meanskill))
In [73]:
i in sub_scores['Index A'].values
Out[73]:
In [69]:
index = sub_scores['Index A'] == i
In [71]:
name = sub_scores.loc[index, 'Team A'].unique()
In [72]:
name
Out[72]:
In [172]:
i in sub_scores['Index A']
Out[172]:
In [175]:
sub_scores['Index B'].unique()
Out[175]:
In [60]:
with pm.Model() as model:
skill = pm.Cauchy('skill', alpha=alphas, beta=0.5, shape=len(teams))
# skill_div3 = pm.Cauchy('skill_div3', alpha=-1.0, beta=0.5, shape=n_teams3)
# skill_div4 = pm.Cauchy('skill_div4', alpha=-3.0, beta=0.5, shape=n_teams4)
# skill = pm.Lognormal('skill', mu=0, tau=1, shape=n_teams + 2)
# skill = pm.Exponential('skill', lam=1, shape=n_teams + 2)
# scale = pm.HalfNormal('scale', sd=100)
A_minus_B = []
B_minus_A = []
for row in mixed_scores.index:
team_A = mixed_scores.loc[row, 'ID A']
team_B = mixed_scores.loc[row, 'ID B']
div_A = mixed_scores.loc[row, 'divname']
# print(team_A, team_B)
A_minus_B.append(skill[team_A] - skill[team_B])
B_minus_A.append(skill[team_B] - skill[team_A])
# A_minus_B = np.array(A_minus_B)
lower = 1e-6
upper = 1 - 1e-6
probability_A_beats_B = lower + (upper - lower) * 1 / (1 + tt.exp(B_minus_A))
# probability_A_beats_B = pm.math.invlogit(A_minus_B)
# probability_A_beats_B = 1. / (1 + pm.math.exp(B_minus_A))
observation = pm.Bernoulli('observation', probability_A_beats_B, observed=mixed_scores['Team A Wins'].values)
In [38]:
with pm.Model() as model:
# skill = pm.Cauchy('skill', alpha=alphas, beta=0.5, shape=len(teams))
skill = pm.Normal('skill', sd=0.5, shape=len(teams))
B_minus_A = skill[sub_scores['Index B'].values] - skill[sub_scores['Index A'].values]
lower = 1e-6
upper = 1 - 1e-6
probability_A_beats_B = lower + (upper - lower) * 1 / (1 + tt.exp(B_minus_A))
# probability_A_beats_B = pm.math.invlogit(A_minus_B)
# probability_A_beats_B = 1. / (1 + pm.math.exp(B_minus_A))
observation = pm.Bernoulli('observation', probability_A_beats_B, observed=sub_scores['Team A Wins'].values)
In [52]:
i
Out[52]:
In [7]:
xplot = np.arange(-15, 15, 0.1)
In [8]:
yplot = 1 / (1 + np.exp(-xplot))
In [9]:
import matplotlib.pyplot as plt
plt.plot(xplot, yplot)
Out[9]:
In [ ]: